<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN2">
<html>
<head>
<title>Removing restrictions on valid characters in paths</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <meta http-equiv="Content-Style-Type" content="text/css">
    <link href="http://dev.eclipse.org/default_style.css" rel="stylesheet" type="text/css">
</head>
<body bgcolor="#ffffff">

<table width="100%">
<tr><td style="background:#0080C0"><b><span style="color:white">Proposal</span></b></td></tr>
</table>
 
<h1>Removing restrictions on valid characters in paths</h1>
 
<blockquote> <b>Summary</b>  <br>
The data structure <tt>org.eclipse.core.runtime.IPath</tt> and its 
canonical implementation, <tt>org.eclipse.core.runtime.Path</tt>, impose
restrictions on segment names that can be more restrictive
than those for file names in the underlying file system. When Eclipse paths
are used to represent file system paths, these restrictions prevent 
valid files from being added to the Eclipse workspace. This document
describes the current set of restrictions in Eclipse 3.0, and proposes changes to
lift these restrictions.
<p>
<font size="-1">Last modified: October 18, 2004</font> 
</blockquote>
<p>
<h3>Background</h3>
<p>
<tt>IPath</tt> is an abstract data structure supplied by the <tt>org.eclipse.core.runtime</tt>
plug-in, and consists of the following parts:
<ul>
<li>An optional device (a <tt>String</tt>)</li>
<li>Zero or more ordered segments (represented as <tt>String</tt>)</li>
<li>Optional leading, trailing, and UNC separators</li>
</ul>
<p>
To facilitate conversion between <tt>IPath</tt> and <tt>String</tt>
instances, <tt>IPath</tt> reserves the colon (':') character as the device
delimiter, and the forward and back slash ('/' and '\') characters as segment delimiters.
The API javadoc for <tt>IPath.isValidSegment</tt> outlines the complete set
of restrictions:
 <ul>
 <li> the empty string is not valid
 <li> any string containing the colon character (":") is not valid
 <li> any string containing the slash character ("/") is not valid
 <li> any string containing the backslash character ("\") is not valid
 <li> any string starting or ending with a whitespace character is not valid
 <li> all other strings are valid
 </ul>
Contrast this with the restrictions on file names in Unix (as defined in the "Base Definitions" volume 
of IEEE Std. 1003.1-2001, section 3.169 Filename):
 <ul>
 <li> the empty string is not valid </li>
 <li> any string containing the slash character ('/') is not valid</li>
 <li> the strings "." and ".." have special meaning</li>
 <li> any string containing the null byte is not valid</li>
 <li> all other strings are valid</li>
 </ul>
Thus when Eclipse <tt>IPath</tt> objects are used to represent Unix file names,
they are unable to represent file names containing '\' or ':', or file names with leading
or trailing white space.

<h3>Proposed Solution</h3>
<p>
Lifting the restriction on paths with leading or trailing whitespace and paths
containing the '\' character is easily achieved by specifying a new constructor for 
creation of paths.  Lifting the restriction on the ':' character is more
difficult, since it is needed as the path delimiter on operating systems that 
support a device.
<p>
The solution must accomodate the two interesting categories 
of <tt>IPath</tt> users:
<ul>
<li>Code that needs a platform-specific representation. In other words, someone that
actually wants to open a stream on a file, delete a file, etc. They essentially need to
translate between an <tt>IPath</tt> and a <tt>java.io.File</tt>.</li>
<li>Code that is reading, storing, and manipulating file system paths, but does 
not need the platform-specific manifestation. In particular, there is a strong need
for a serialized representation of paths in a platform-neutral way so that they can 
be later read and interpreted on a different platform (typically in the form of a path
that is relative to some platform specific prefix represented by a variable).
</ul>
<tt>IPath</tt> already acknowledges these two uses in its <tt>toString</tt> methods.
The standard <tt>toString</tt> method creates a platform-neutral encoding of
the path as a <tt>String</tt>. The <tt>toOSString</tt> method creates a platform-specific
encoding suitable for passing to <tt>java.io.File</tt> or other API that deals directly
with the file system.
<p>
The proposed solution is to introduce two constructors for creating <tt>IPath</tt>
that perform the inverse of the two existing <tt>toString</tt> methods:
<ul>
<li><code>Path.fromOSString</code>: A factory method that decodes a platform-specific string.  
For example, this will parse the output of a previous call to <tt>IPath.toOSString</tt>, 
or the value returned by <tt>java.io.File.getAbsolutePath</tt>.</li>
<li><code>Path.fromPortableString</code>: A factory method that decodes a platform-neutral 
string, such as the output of a previous call to <tt>IPath.toPortableString</tt>.</li>
</ul>
<p>
Since changing the behaviour of the existing <code>toString</code>
method would cause too much breakage, an new method, <code>toPortableString</code>
will be introduced for creating a platform-neutral string representation of paths.
The existing <code>toString</code> method will remain unchanged.
<p>
Most clients will use the platform-specific form of paths. The path
can be converted to/from a platform-neutral representation when a path
needs to be serialized in a portable fashion.
<p>
The platform-neutral encoding of paths (<code>IPath.toPortableString</code>) will allow
all characters except slash ('/') in segment names, and include an optional device 
separated from the segments by a single colon character. Literal colon characters in path
segments are escaped through doubling (one colon becomes two colons).
The following are some examples of windows file system paths and the 
corresponding platform-neutral encoding:
<ul>
<li>"C:\folder\file.txt" becomes "C:/folder/file.txt"</li>
<li>"C:folder\file.txt" becomes "C:folder/file.txt"</li>
<li>"C:\folder\" becomes "C:/folder/"</li>
</ul>
Canonical Unix paths look identical to their platform-neutral encoding, except
in the presence of segments containing the colon character. The following are
some Unix paths and the corresponding encoding by <tt>IPath.toPortableString</tt>:
<ul>
<li>"/etc/" is encoded as "/etc/"</li>
<li>"/etc/passwd" is encoded as "/etc/passwd"</li>
<li>"/etc/timeNowIs4:25:12PM" is encoded as "/etc/timeNowIs4::25::12PM"</li>
<li>"c:/folder/file.txt" is encoded as "c::/folder/file.txt"</li>
</ul>
<p>
 UNC paths, which typically have no device but have a double leading separator 
 will generally be the same
<ul>
<li>"//Server/Volume" becomes "//Server/Volume"</li>
<li>"//Server/TimeIs4:25:12PM" becomes "//Server/TimeIs4::25::12PM"</li>
</ul>
If for some reason a UNC path had a device, it will preceed the slashes:
<ul>
<li>"C://Server/Volume" becomes "C://Server/Volume"</li>
<li>"C://Server/TimeIs4:25:12PM" becomes "C://Server/TimeIs4::25::12PM"</li>
</ul>
<p>
This platform-neutral encoding unambiguously encodes all possible paths on
all supported platforms. Most importantly, this <tt>toPortableString</tt> implementation
is fully backward compatible with the Eclipse 3.0 implementation of <tt>IPath.toString</tt>
for all paths that can be created in Eclipse 3.0. This means that clients who
previously used <code>toString</code> for serializing paths can move to the
new <code>toPortableString/fromPortableString</code> methods without
migrating file formats.
<p>
The platform-specific Path factory method will impose the minimum platform-specific
requirements needed to unambiguosly parse all possible paths on that platform. 
The Windows implementation, for example, will interpret everything up to the
first ':' as the device, and treat both '/' and '\' as path segment separators. No
other rules will be imposed. Thus the existing restriction on paths that prevents
path segments from having leading or trailing whitespace will no longer be enforced
on any platform.
<p>
As before, detailed validation of all legal characters and names on that platform 
will not be enforced. Some clients use technology such as Cygwin or Samba to 
mount foreign file systems on a platform. In these situations, path name rules
for the local file system do not apply. While it is difficult to fully support these users,
any additional platform-specific verification performed on paths causes further
problems for these users. Imposing the absolute minimum requirements for
unamiguously parsing paths allows the majority of users to function without
further impacting the corner cases.
</p>

<h3>API Details</h3>
<p>
The following existing methods on <tt>IPath</tt> and <tt>Path</tt> are affected:
<ul>
<li><tt>isValidPath/isValidSegment</tt>. The implementation of these 
methods will change to match the <code>fromOSString</code> factory 
method. In other words, path validity becomes a platform-specific issue. 
The specification will change to a more ambiguous wording stating only that 
certain characters are reserved on some operating systems. In implementation, 
it will just check for the device separator on operating systems that require it.
The restriction on leading and trailing spaces in segment names will be 
removed on all operating systems that allow such paths.</li>
<li><tt>Path(String, String)</tt>. This constructor previously had the unspecified
behaviour of extracting the device from the second argument when no device
argument was supplied. This behaviour will change to no longer parse the 
device from the second argument. This constructor will now handle literal
colon characters in path segments on all platforms.</li>
<li><tt>IPath.append(String)</tt>. This method implicitly constructs or interprets
a path literal from the provided parameter.  This method will use the os-specific
rules when interpreting the provided path ('\' and ':' treated as segment and
device delimiters on Windows only).</li>
</ul>
New methods for Eclipse 3.1:
<ul>
<li><tt>Path.fromPortableString</tt>. A factory method for producing an <tt>IPath</tt>
given a platform-neutral encoding of a string. This is the inverse of the
<tt>IPath.toPortableString</tt> method. In particular, double colons will be interpreted
as single colon characters in segment names, and the first single colon (if any) 
will be treated as the device separator.</li>
<li><tt>toPortableString</tt>. A method for producing a platform-neutral
encoding of a path as a string, suitable for storing in files that need to
be platform-neutral.  This encoding will escape literal colons in path segments 
using a double colon.</li>
<li><tt>Path.fromOSString</tt>. A factory method for producing an <tt>IPath</tt>
given a platform-specific encoding of a string. This is the inverse of the
<tt>IPath.toOSString</tt> method. On Windows the colon character is treated
as the device separator, and both varieties of slash are treated as segment separators.</li>
</ul>
</p>
<h3>What do we do with the <tt>Path(String)</tt> constructor?</h3>
<p>
This proposal introduces two factory methods that clearly distinguish
platform-neutral and platform-specific encodings of paths.  The difficult
question is what to do with old single argument <tt>Path</tt> constructor. The two options are:
<ol>
<li>
Leave the implementation of this constructor unchanged, but deprecate
it.  The advantage of this solution is that it does not break the API contract
spelled out in the current <code>Path</code> constructor, which explicitly
states how it handles ':' and '\' characters.  The disadvantage is that this will 
require all callers of the existing <tt>Path</tt> constructors 
to migrate to one of the two path factory methods, depending on the origin 
of the path string being used.  Clients that do not migrate to the new factory
methods risk errors introduced when trying to construct <code>IPath</code>
instances corresponding to file system paths that were previously treated as
invalid. For example, the resources plugin would allow introduction of
resources with the ':' and '\' characters.  Other plugins trying to create
a path corresponding to those resources using the old constructors will
fail.  Experiments with this solution showed that plug-ins that failed to
migrate to the new factory method were broken due to the unexpected
introduction of previously invalid paths.  This presents a bleak picture
for backwards-compatibility, regardless of the fact that no API contracts
are broken.
</li><li>
The second option is to change the existing single argument path constructor to be
platform-specific. In other words, the Windows implementation of these
methods would remain unchanged, but implementations on other platforms
would stop treating ':' as the device separator, and no longer treat '\'
as a path segment separator.  This clearly violates the existing API
specification of the <code>Path</code> constructor. On the positive
side, this introduces very little breakage in practice. The net effect
is of removing old restrictions on some operating systems. The only
breakage will be caused to clients who use a device for some reason
on all operating systems, and clients that need to construct <code>IPath</code>
objects representing file system paths from platforms other than the one
that the current Eclipse instance is running in. For example, a plug-in running
on Linux would not be able to use the old constructors to create <code>IPath</code>
objects representing files from a remote Windows system.
</li></ol>
<p>
After investigating the implementation of both of the above approaches,
the second option introduces the smallest breakage by far. For example,
the first option requires almost all of the 600 references to the <code>Path</code>
constructors found in the current edition of the Eclipse platform. The second
option requires only a small set of localized changes in code that deals
with serializing and deserializing paths in a platform-neutral manner. Based
on testing the implementation of these two options, this proposal recommends
option two.
<h3>Examples</h3>
<p>
The following examples illustrate the behaviour of the various <tt>Path</tt>
constructors and <tt>to*String</tt> methods.
<p>
Given the absolute path with device "C:" and single segment "foo", the following
<tt>IPath</tt> methods will produce:
<ul>
<li>toString -> "C:/foo"</li>
<li>toPortableString -> "C:/foo" </li>
<li>toOSString (windows) -> "C:\foo" </li>
<li>toOSString (Linux) -> "C:/foo" </li>
<li>getDevice -> "C:"</li>
</ul>
Given the relative path with null device, and two segments "C:" and "foo":
<ul>
<li>toString -> "C:/foo"</li>
<li>toPortableString -> "C::/foo"</li>
<li>toOSString (windows) -> "C:\foo"</li>
<li>toOSString (Linux) -> "C:/foo"</li>
<li>getDevice -> (null)</li>
</ul>
Given the string "C:\\foo" (single backslash escaped in Java literal format), the following constructors will produce:
<ul>
<li>fromOSString and Path(String) (windows) -> Absolute path with device "C:" and single segment "foo"
<li>fromOSString and Path(String) (Linux) -> Relative path with null device and single segment "C:\foo"
<li>fromPortableString -> Relative path with device "C:" and single segment "\foo"
</ul>
Given the string "C:/foo":
<ul>
<li>fromOSString and Path(String) (windows) -> Absolute path with device "C:" and single segment "foo"
<li>fromOSString and Path(String) (Linux) -> Relative path with null device and two segments "C:" and "foo"
<li>fromPortableString -> Absolute path with device "C:" and single segment "foo"
</ul>
Given the string "C::/foo":
<ul>
<li>fromOSString and Path(String) (windows) -> Relative path with device "C:" and two segments ":" and "foo" (an invalid path)
<li>fromOSString and Path(String) (Linux) -> Relative path with null device and two segments "C::" and "foo"
<li>fromPortableString -> Relative path with null device and two segments "C:" and "foo"
</ul>
<h3>Other migration issues</h3>
<p>
All clients who store absolute <tt>IPath</tt> objects as platform-neutral 
strings in a serialized form (as produced by <tt>IPath.toString</tt> in Eclipse 3.0), 
should switch to the new <tt>fromPortableString/toPortableString</tt> methods 
rather than the <tt>Path</tt> constructor and the <tt>toString</tt> method. 
Backward compatibility with files written by Eclipse 3.0 is automatic 
(no changes to file format or changing file format version numbers required).
Examples of files that contain string representations of paths that will 
need to migrate include the workspace <tt>.project</tt> and <tt>.classpath</tt> files.
</p>
<h3>Other Observations</h3>
<p>
Under this proposal <tt>IPath.toPortableString</tt> and <tt>Path.fromPortableString</tt>
are perfect inverses of each other. In other words, the expression
<pre>
path.equals(Path.fromPortableString(path.toPortableString()))
</pre>
will be true for all paths, and
<pre>
string.equals(Path.fromPortableString(string).toPortableString())
</pre>
will be true for all strings that represent canonical paths (strings with duplicate slashes
or "." and ".." references will turn out differently). Furthermore, the Eclipse 3.1 implementation of 
<tt>Path.fromPortableString</tt> will be the perfect inverse of the Eclipse 3.0 implementation
of <tt>IPath.toString</tt>.
<p>
On Unix, the <tt>toOSString</tt> and <tt>fromOSString</tt> methods will be
inverses of each other.  On Windows, the same can only be said for paths that
do not contain colon or backslash characters within segment names (such paths
are invalid on Windows anyway). Consider the following example:
<pre>
   String input = "foo::bar";
   IPath pathOne = Path.fromPortableString(input);
   IPath pathTwo = Path.fromOSString(pathOne.toOSString());
   pathOne.equals(pathTwo) -> false!
</pre>
The input string represents a path with no device, and a single segment whose
name is "foo:bar" (invalid on Windows).  When this is output using <tt>toOSString</tt>,
it is encoded as "foo:bar".  The <tt>fromOSString</tt> then interprets this as
a path with device "foo:" and first segment "bar". Similar mangling occurs if you
create a path with a segment containing the backslash character:
<pre>
   String input = "foo\\bar"; 
   IPath pathOne = Path.fromPortableString(input);
   IPath pathTwo = Path.fromOSString(pathOne.toOSString());
   pathOne.equals(pathTwo) -> false!
</pre>
In this case, the input is a path with one segment whose name is "foo\bar". This
is interpreted by <tt>fromOSString</tt> as a path with two segments "foo" and "bar".
In other words, under this proposal you cannot reliably manipulate paths containing
backslash or colon using to/fromOSString on Windows. This seems to be an acceptable
limitation.
</p>
<h3>References</h3>
<p>
<ul>
<li><a href="path_migration.html">Draft migration guide entry</a></li>
<li><a href="https://bugs.eclipse.org/bugs/show_bug.cgi?id=24152">bug 24152</a></li>
</ul>
</body>
</html>